Targeted Gene Metagenomic Data Analysis    ◾    255

how the reads and barcodes are stored. The reads also might be sequenced as single end or

paired end. The paired-end reads are usually stored in two FASTQ files.

7.2.2  Metagenomic Features

After the preprocessing, the raw reads of the targeted gene (16S rRNA) are dereplicated

by reducing the abundance of the similar reads based on a Hamming distance threshold.

The dereplicated reads are then clustered by collapsing the complete sets of reads into a

grouping units called operational taxonomic units (OTUs) [5]. Once the OTUs have been

formed, representative sequences (features) can either be selected from those OTUs or the

sequences of each OTUs are denoised to remove potential errors before selecting represen-

tative sequences. In the following, we will discuss both clustering and denoising.

7.2.2.1  Clustering

There are three approaches for clustering the preprocessed amplicon reads. The first is the

de novo clustering [6] in which the reads are clustered into OTUs according to their simi-

larity at a specified threshold. It compares each read against each other and then it groups

sequences into OTUs. The second is called open-reference clustering in which reads are

clustered around previously annotated reference sequences by sequence classification or

searching in a database (Greengenes database) and the reads that do not cluster around

reference sequences are clustered with de novo method. The third is the closed-reference

clustering in which reads are clustered around reference sequences and the reads that do

not find reference sequences are removed [7]. The closed-reference clustering may intro-

duce reference bias to the clustering process since the reads without references will be

removed, while they may represent novel species or taxa. The de novo clustering is pre-

ferred, although it may be computationally expensive other than the other two clustering

methods. The algorithms for de novo clustering include hierarchical clustering and heu-

ristic clustering. The hierarchical clustering method requires a matrix for the pairwise dis-

tances between all unique sequences. A hierarchical tree is constructed from the distance

matrix, and then the OTUs are constructed from the tree based on a distance threshold.

The heuristic clustering uses pairwise sequence alignment to construct a distance

matrix one by one instead of computing all distances in a single step. The heuristic cluster-

ing method generates OTUs in a greedy incremental strategy that utilizes one sequence as

a seed to represent a cluster. Then, each one of the unique reads is compared with the seeds

of existing clusters. A read is assigned to a cluster if the pairwise distance between a unique

read and a seed meets a predefined threshold. If a read does not join any cluster, that read

will become a seed for a new cluster. The final clusters represent the OTUs. The heuristic

clustering method is more computationally efficient than the hierarchical clustering. There

are a variety of heuristic clustering methods that vary in seed selection and distance cal-

culation. Examples of the tools that use heuristic clustering for constructing OTUs include

USEARCH and CD-HIT.

In general, an OTU constructed by any of the clustering methods is considered as a tax-

onomic group. From this point, the analysis can move on to the taxonomic group assign-

ment, construction of the phylogenetic tree, and diversity analysis but the latest analytic